On the criticality of inferred models 



lacopo Mastromatteo^ and Matteo Marsili^ 
^ International School for Advanced Studies, via Beirut 2/4, 34014, Trieste, Italy 
^ The Abdus Salam International Center for Theoretical Physics, Strada Costiera 11, 34014 Trieste, Italy 

Advanced inference techniques allow one to reconstruct the pattern of interaction from high 
dimensional data sets, which probe simultaneously thousands of units of extended systems - such as 
cells, neural tissues or financial markets. We focus here on the statistical properties of inferred models 
and argue that inference procedures are likely to yield models which are close to singular values of 
parameters, akin to critical points in physics where phase transitions occur. These are points where 
the response of physical systems to external perturbations, as measured by the susceptibility, is very 
large and diverges in the limit of infinite size. We show that the reparameterization invariant metrics 
in the space of probability distributions of these models (the Fisher Information) is directly related 
to the susceptibility of the inferred model. As a result, distinguishable models tend to accumulate 
close to critical points, where the susceptibility diverges in infinite systems. This region is the one 
where the estimate of inferred parameters is most stable. In order to illustrate these points, we 
discuss inference of interacting point processes with application to financial data and show that 
sensible choices of observation time-scales naturally yield models which are close to criticality. 

PACS numbers: 64.60.aq, 64.60.Cn, 89. 75. He 



The behavior of complex systems such as a cell, the 
brain or a financial market, is the result of the pattern 
of interaction taking place among its components. Tech- 
nological advances, either in experimental techniques or 
in data storage and acquisition, have made the micro 
scale at which the interaction takes place accessible to 
empirical analysis. Massive data, probing for instance 
the expression of genes in a cell [1], the structure and 
the interactions of proteins [2], the activity of neurons 
in a neural tissue |3, or the one of traders in financial 
markets [S] is now available. This, in principle, makes the 
reconstruction of the network of interactions at the micro 
scale possible. The reconstruction consists in inferring a 
model, specifying the wiring of the network of interac- 
tions between micro-units, as well as their strength. 

The typical situation is one where the micro-state is 
specified by a vector s, with component Si specifying 
the state of unit i, and data consists of a sequence 
s = {s*-*' , t = \, . . . ,T} oiT samples. Under the assump- 
tion that samples can be considered as independent, the 
problem consists in estimating the probability distribu- 
tion of s, in a way which allows for robust generalization, 
i.e. for the generation of yet unseen samples. 

As a mean of illustration, it is instructive to discuss 
a specific example. Prices of stocks in a financial mar- 
ket move in a correlated fashion. This correlation arises 
from the correlated activity of traders buying and selling 
the different stocks. So for example, a particular activity 
pattern on stock 1 may be interpreted as revealing some 
information to traders, which may induce them to trade 
stock 2. One way to formalize this idea in a statistical 
model, is to fix a time interval Ai and define a binary 
variable on each stock which takes value -1-1 if a trade 
occurred in that interval, or —1 otherwise. In this way, 
the activity of a stock market with N stocks is repre- 



sented as a string of N "spins" Si — ±1, and repeated 
measurements produce T samples of s. 

In spite of its abundance, data is far from being able to 
completely identify the correct model, and one is left with 
a complex inference problem. This is because the number 
of available samples is way smaller than the number of 
possible microscopic states. In our workhorse example, 
even reducing attention to the N ~ 100 most traded 
stocks, and at very high frequency At = 30s, one year of 
data amounts to T sa 10'^ samples, whereas the number 
of possible micro-states is 2^°" « lO'^". 

This problem is addressed in statistical learning the- 
ory in two steps: i) model selection and ii) inference of 
parameters. Boltzmann learning [6] addresses i) hy first 
identifying those empirical quantities which we want the 
model to reproduce and then invoking the principle of 
maximum entropy [7|- So, for example, if correlated ac- 
tivity on stocks is the result of interaction mediated by 
traders, and we assume that traders react to movement 
on single stocks, it is natural to require that our model 
reproduces the observed pairwise correlations between 
stocks. If we require that the distribution of s reproduces 
the measured values of a collection $ = of M 

functions of the micro-state s, then maximal entropy pre- 
scribes distributions of the exponential form: 

p(%)-cxp|^X]5^0'^(s)^ , (1) 

where g = {g^ : /j, = . . . M} S Q are the parameters of 
the model, to be inferred in ii) (see below). 

Our focus here, is not on model selection nor on the 
inference procedure. Rather, we focus on the statistical 
properties which we expect to observe in inferred models, 
and argue that there are reasons to expect them to be 
very peculiar. 
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Probability distribution of the form (jlj, in the hmit 
N ^ oo, have been the object of enquiry in statistical 
mechanics, since its very beginning, in particular, as a 
function of "temperature" which in physics modulates 
the strength of the interactions between variables. A fic- 
titious (inverse) temperature can be introduced with the 
replacement g — ^ Pg. Then p(s\f5g) is expected to inter- 
polate between a "low temperature" behavior (/? oo), 
where the distribution is concentrated on few states and 
the Si are strongly correlated, and a "high temperature" 
behavior (/? — )■ 0), where the different components of s 
are very weakly dependent. These two polar behaviors, 
often, do not morph continuously into each other as /3 
varies in [0, oo) but rather they do so in a sharp manner, 
in a small neighborhood of a critical inverse tempera- 
ture /3c. The critical point /3c is characterized by the fact 
that fluctuations - corresponding e.g. to specific heat in 
physics - become very large. 

Remarkably, inference procedure often produces mod- 
els which seem to be "poised close to a critical point", 
i.e. for which fluctuations are maximal for /3 1, sug- 
gesting with /3c w 1. This was first observed in [5] for 
the activity of neuronal tissues and in [5] for the statics 
of natural images. Fig. [T] presents similar evidence for 
the activity pattern of 100 stocks of the New York Stock 
Exchange at high frequency (see caption and discussion 
below for details). 

Critical models of the form ([T]) seem to be rather spe- 
cial in physics, since they arise only when the parameters 
are fine tuned to a set of zero measure. In spite of this, 
they have attracted and still attract considerable interest, 
with much efforts being devoted to elucidate their proper- 
ties. On the contrary, critical models seem to arise ubiq- 
uitously in the analysis of complex systems [TU] , evoking 
theories of Self-Organized Criticality [TT] . 

Leaving aside the mechanisms for which a real system 
self-organizes close to a critical point, we address here 
the issue from a purely statistical point of view. So, for 
example, when can one conclude in a statistically signif- 
icant manner that a given system is critical? And then, 
can criticality be induced by the inference procedure? 

Specifically, drawing from results of information geom- 
etry [121 [13] we argue that when the distance in the space 
of models is properly defined in a reparametrization in- 
variant manner, one finds that the number of statistically 
distinguishable models accumulates close to the region in 
parameter space where models are critical. Conversely, 
models far from the critical points can hardly be distin- 
guished. Loosely speaking, models that can be inferred 
are only in a finite region around critical points. This 
implies that when the distance from the critical point 
is measured in terms of distinguishable models, inferred 
models turn out to be typically much further away from 
criticality than what the distance of estimated param- 
eters from criticality would suggest. This provides an 
alternative characterization of criticality, whose relation 



with information theory was earlier investigated in [TJ] in 
the context of thermodynamic fluctuation theory, and in 
[15j in relation to quantum phase transitions. Indeed our 
discussion just relies on basic properties of statistical me- 
chanics and large deviation theory, and doesn't require 
any specific assumption about the model it is applied to. 

In what follows, we address the problem of the infer- 
ence of a probability distribution over a set of binary 
variables. We recall [H] that the statistical distinguisha- 
bility of empirical distributions, naturally leads to the no- 
tion of curvature in the space of probability distributions. 
Then we show that curvature is related to susceptibility 
of the corresponding models. We apply these ideas to the 
model of a fully connected ferromagnet, which despite be- 
ing simple enough to provide tractable solution, realizes 
all the features previously described. We illustrate the 
points above by specializing to inference of data produced 
by "fully connected" Hawkes point-processes [I^. This 
shows that when a fictitious temperature is introduced 
in estimated models, a maximum of the specific heat of 
the corresponding Ising ferromagnet naturally arises for 
/3«L 
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FIG. 1: Specific heat as a function of the inverse temperature 
P for financial data, for various choices of the bin sizes (lines 
from the top to the bottom correspond respectively to At = 
30,28,26,24 s). 



DISTINGUISHABILITY OF STATISTICAL 
MODELS 

Given a set s = {s^*-*, i = 1, . . . , T} of T observations 
of a string of N binary variables s € {—1, 1}^, we con- 
sider the problem of estimating from empirical data a 
statistical model A4 = i^^G} defined by a probability 
distribution as in Eq. where $ = {'/'^(s)}/t^o ^ 
collection of functions of the vector s G {—1, 1}^ and 
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g = {(/^ : ^ = . . . M} € G are the corresponding pa- 
rameters. With the choice (/)"(s) = 1, the normahzation 
condition fixes 5^(5) to be the free energy of the Hamil- 
tonian H{s) = — X]^>o -9^ 4''^{^ temperature equal to 
one. We shall denote by (. . .)g averages taken over the 
distribution p{s\g). 

We briefly recall the arguments of Ref. [12] to assess 
whether g and g' are distinguishable. Imagine that the 
inference procedure returns g as the optimal parameters 
of the distribution and consider resampling a set Sg of 
T observations from g. By Sanov's theorem [T7], the 
probability that the empirical distribution of the sam- 
ple Sg generated by T i.i.d. draws from p{s\g) falls in 
a close neighborhood of g' is given in the large T limit 

hy p{sgW) ^ e-™(^ll^'), where DiglW) = (log ^ 

is the Kullback-Leibler distance of the two distributions. 
Requiring that this probability be less than a threshold e, 
for g and g' to be distinguishable, implies D{g\\g') < n/T 
with K = — log e. This condition identifies a volume of 
parameters around g of distributions which cannot be 
distinguished from g, for a finite data set. Since we as- 
sumed T 3> 1, this volume can be computed from the 
expansion 



D{g\\9 



0{rf 



where x is the matrix of second derivatives oi D{g\\g + r]) 
computed in rj ^ 0, and is known as the Fisher Informa- 
tion (FI). The volume of distributions which are undis- 
tinguishable from g is given by 12J: 



AVn 



T,k 



27rK 



M/2 



1 



r(M/2 + l)Vdet^ 



(2) 



In the language of statistical mechanics, FI corresponds 
to a generalized susceptibility, and via fluctuation- 
dissipation relations, to the covariance of operators 



dg- 



^ dgt'dg'' 

Models with a large FI correspond to models with high 
susceptibility for which the error on the estimated cou- 
plings is small. More precisely, the Cramer-Rao bound 
[T7] states that given a set of T independent observa- 
tion and an unbiased estimator of the couplings g* , its 
covariance matrix satisfies Cov(g*) > iC^ 1^ where, the 
notation A> B indicates that the matrix A — B is posi- 
tive semidefinite. 

Summarizing, the FI provides a parameterization in- 
variant metrics in the space of statistically distinguish- 
able distributions dr = JJ dg'^ det x{g) ■ This measure 



concentrates around the "critical" points g where the 
susceptibility is large (or diverges for N — 00), which 
correspond to points where estimates of parameters are 
more precise. On the contrary, since the susceptibility 
decreases fast away from critical points, the volume ele- 
ment dr is expected to be non-negligible only in a bound 
region of space. The outcome of the inference procedure 
can be considered meaningful when the susceptibility is 
sufficiently large, or equivalently, when the error in the 
inferred coupling is small enough. This suggests a maxi- 
mum distance from the critical point at which parameters 
can be inferred [T5] . 



The case of fully connected spin models 

In order to illustrate these concepts, let us consider the 
simple case of a fully connected ferromagnet character- 
ized by the operators <& = {^t j;J2i<j ^i^j^J^i^i} ^^'^ 
the corresponding couplings g — { — log Z, J, h}. The cal- 
culation of the FI is straightforward and, for ^ 1, 
to leading order, one finds that the invariant measure of 
distinguishable distributions is given by 



dJdh 



where F 
and 



dm 
dh 



is the spin susceptibility 
f h]. The 6{h) contribu- 
in the 



m{ J, h) — tanh[Jm( J, K) 
tion arises from the discontinuity of m at /i = 
ferromagnetic region J > 1 with A{J) ~ \/2Ti'^m?T. 

In the highly magnetized region (J ^ 1, h ^ 0), 
the non-singular contribution to the density of states 
dr ss \/%Ne~^^'^^^'^^'^dJdh can be explicitly integrated 
to obtain the number of distinguishable states in a fi- 
nite region of the phase space. For example, the num- 
ber of distinguishable states in the semiplane J > Jmax 
stripped of the ft, sa line, given T observations and a 
threshold k is D « T^/N -i/l-e-^'^^ox xhis means that 
it is not possible to meaningfully infer any value of J 
greater than Jmax ^ g log {T\fN ^ . Under the hypothe- 
sis that ft = 0, instead, we find that Jmax ~ log 
The volume element dr diverges in a non-integrable man- 
ner close to the critical point (J, ft) = (1,0). For h — 
we find c?T ~ |1 — J|~'^/^ while approaching the critical 
point on the line J = 1, one finds the milder divergence 
dr ^ 1^1^^- Hence, there is a macroscopic number of 
models located in an infinitesimal region around the crit- 
ical point (J, ft) — (1,0). The singularity is smeared by 
finite size effects when N < +00, but it retains the main 
characteristics discussed above. A plot of the density of 
models for N = 100 is shown in Fig. [2] 
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APPLICATION TO HAWKES PROCESSES AND 
REAL DATA 

In order to investigate the implications of this pic- 
ture on the inference of models from data we address the 
specific case of synthetic data generated by Hawkes pro- 
cesses [TB]. An N dimensional Hawkes process is a gen- 
eralized Poisson process where the probability of events 
P{dNl = l|A^t} = Xldt in an infinitesimal time inter- 
val dt depends on a rate which is itself a stochastic 
variable 

which depends on the past realization of the process (here 
/Lt* > 0, K]^ > 0). This process reproduces a cross- 
excitatory interaction among the different channels, akin 
to that occurring between stocks in a financial market 
[20] or neurons in the brain [21]. For our purposes, it 
will serve as a very simple toy model to generate data 
with controlled statistical properties, of a similar type to 
that collected in more complex situations. In fact, the 
linearity of the model makes it possible to derive analyt- 
ically some properties in the stationary state. We focus 
on a fully connected version of the Hawkes process, with 
^' = ^ and K^^ — ^e^"^. The expected number of 
events per unit time is (AJ) = , and it diverges for 

a ^ V. We remark that this singularity is not a proper 
phase transition, as it occurs for any finite N. 

We also estimate the activity pattern of an ensemble 
of 100 stocks of NYSE market (see [52] for details on the 
dataset). We consider the jump process defined by the 
function which counts the number of times in which 
stock i is traded in the time interval [0,t], disregarding 
the (buy or sell) direction of the trade. Data refers to 100 
trading days (from 02.01.2003 to 05.30.2003), of which 
only the 10'* seconds of the central part of the day were 
retained, in order to avoid the non-stationary effects due 
to the opening and closing hours of the financial market 
(see [10]). 

Following [3], we map a data-set of events into a se- 
quence of spin configurations, by fixing a time interval 
At and setting sj = +1 if AN'^ = Nl^^^ - Nl > Q and 

= — 1 if no event on channel i occurred in [t,t + At). 
The choice of At fixes not only the number of data points 
T = U /At, where U = 10^ seconds is the total length of 
the time series. It also fixes the scale at which the dy- 
namics of the system is observed: for Ai — ^ the system 
is non-interacting, and can be successfully described by 
an independent model j23j , while after a certain time 
scale correlations start to emerge [24]. Indeed the prod- 
uct of the bin size AT with the event rate A also controls 
the average magnetization of the system, which can ac- 
cordingly be tuned from —1 to 1. Hence, as At varies, 
the inferred model performs a trajectory in the space of 



couplings. 

We fit both data with a model of pairwise interacting 
Ising spins, with operators $ = {l}U{si}^jU{siSj}f^^^j 
and the corresponding couplings g = { — log Z} U 
{hi}fLi U {Jij}i^^j=i- Several approximate schemes have 
been proposed to compute efficiently the maximum en- 
tropy estimate of the couplings j| . Here we resort to 
mean-field theory, which turn out to give results which 
are consistent with more sophisticated schemes. 

Fig. [2] reports the results of the inference on simu- 
lated Hawkes processes and of financial data (see cap- 
tion for details). Given the inferred parameters, we 
compute the average couplings J = Af(jv-i) J2i<j "^ij^ 
h — hi and report the trajectory of the point (J, h) 

as At varies, for both cases. Fitting a fully connected 
model $ = {'^,^J2i<jS^SJ,J2iSi}, g = {-\ogZj,h} 
produces essentially the same results pS]. 

The region of At where non-trivial correlations are 
present but where the binary representation of events 
is still meaningful [23] corresponds to the region where 
the trajectory (J, h) is closest to the critical point of a 
fully connected ferromagnet (J, h) = (1, 0). By Cramer - 
Rao's bound, this is also the region of At where the in- 
ferred couplings are likely subject to the smallest errors. 
For Hawkes processes, a time interval At smaller than 
1/v does not allow the process to develop correlations, 
and for intervals At ^ the binary nature of the 

process is lost [25 . In addition, i) increasing values of 
the interaction parameter a lead to a sequence of curves 
in the phase diagram which monotonically approach the 
critical point; ii) the mean coupling J increases with bin 
size At, except for a small region of the parameter space 
in the case fi > v; in) h is not monotonic with At. In 
particular for a > i^/2 it decreases for At large. Interest- 
ingly, the points which are inferred can correspond both 
to stable and metastable states. The latter occurs, for 
Hawkes processes, for large At and a > v jl. 

Inference of financial data results in a trajectory sim- 
ilar to that of Hawkes processes with a > v/2 (see Fig. 
§. 

In all these cases, one can introduce a fictitious in- 
verse temperature /3, rescale the inferred couplings as 
{ Jij,hi) — > {pjij, I3hi), and analyze the corresponding 
statistical behavior. The fluctuations of observables as 
a function of /3 provide an indication of the proximity 
of the inferred model to a critical point. In figure [l] we 
plot the specific heat cy = /3'^ Vai[H (s)] for various bin 
sizes in the case of financial data. For a given value of At, 
varying the inverse temperature /? corresponds to moving 
on the line passing through the origin and the inferred 
point (white line in Fig. [2]). If At is in the region close 
to the critical point, for the reasons stated above, then 
fiuctuations will be maximal for (3^1. We remark, how- 
ever, that such a notion of proximity to a critical point is 
only apparent. The distance from the critical point eval- 
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FIG. 2: Mean couplings (J, h) produced by the inference pro- 
cedure in various cases and with various bin sizes. Orange, 
red, purple and blue points correspond to the inferred values 
for a simulated Ifawkes process with v — 0.3, ^ — 0.1 and a 
respectively equal to (0, 0.075,0.15,0.225). Boxes and circles 
correspond respectively to the to the fit of a fully connected 
model (2 parameters, J and h) and to the mean couplings 
for a spin glass (A'^ fields hi and N{N — l)/2 couplings Jij). 
In each of those process we considered A'^ = 100 channels 
producing 5000 events each. The dashed line correspond to 
theoretical, approximate predictions for the inferred couplings 
of those processes at T = oo. The black points correspond to 
the values obtained for U — 10® seconds of financial data cor- 
responding the activity pattern of 100 stocks of the NYSE. On 
the background, the density of models for a fully connected 
model is also plotted for the sake of comparison. The white 
line intersects the origin and the inferred values of (J, h) at 
AT = 18 s: for such a choice of the bin size, a fully connected 
model would have the maximum density of models exactly at 
13 = 1. 

uated using /3 is not invariant under reparametrization of 
the couplings: the number of distinguishable models in a 
given interval of temperature is not constant throughout 
the space of couplings. 



DISCUSSION 

In summary, we have shown that the measure of distin- 
guishable distribution in a parametric family of models 
is directly related to the susceptibility of the correspond- 
ing model in statistical mechanics. As a consequence, 
this measure exhibit a singular concentration at critical 
points. One may speculate that, if experiments are de- 
signed (or data-set collected) in order to be maximally 
informative, they should return data which sample uni- 
formly the space of distributions. This, as we have seen. 



corresponds to sampling a measure in parameter space 
which is sharply peaked at critical points. Hence, in- 
ference of data from maximally informative experiments 
(see [22] for a survey) is likely to return parameters close 
to critical points with high probability. 

As stated in the introduction, critical points separate 
a region (or "phase") of weak interaction, where the 
different components behave in an essentially uncorre- 
lated fashion, from a strongly interactive phase, where 
the knowledge of the microscopic variables in one part 
of the system fixes the state also in far away regions of 
the system. The critical point shares properties of both 
phases. It has the largest possible entropy consistent with 
system wide coherence. Often, system wide coherence is 
implicitly enforced by the fact that we construct data- 
sets with elements we believe to be mutually dependent 
or causally related. We would hardly analyze data-set 
with uncorrelated variables (e.g. the activity of a cell, 
planetary motion and fluctuations in financial markets). 
Therefore, criticality might not only come from the ac- 
tual dynamics of the system, but it might be in the eyes 
of those who are trying to infer the underlying dynamical 
mechanisms. 

Furthermore, if the inference depends on parameters 
which can be adjusted (such as At above), then it is 
sensible to fix these parameters in a way which makes the 
determination of uncertainty about the model as small as 
possible. By Cramer-Rao bound, this again suggests that 
our inference should fall close to critical points. 

At the same time, concluding that a model is close to 
a critical point on the basis of a maximum of the specific 
heat in a plot like the one in Fig. [T]can be misleading. 
Indeed the distance from the critical point should be mea- 
sures in terms of the number of distinguishable models 
which the number of samples allow us to distinguish. It 
might be that even if the model is close to a critical point 
in the space of g (i.e. \g — g \ <C 1) there are many models 
between g and g , which are closer to the critical point 
and which could, in principle, be distinguished on the 
basis of the data. 

Even in the simple example presented here, there are 
some collective properties of the inferred states which 
turn out not to correspond to properties of the real sys- 
tem. The proximity to criticality is one such feature, 
since the original model (Hawkes process) does not have 
a proper phase transition. A further spurious feature 
is the fact that, in a region of parameters (large At 
and a > v/1) , the inferred model exhibits a double 
peaked distribution, but the empirical data is reproduced 
by the least probable maximum. This is the analog of 
metastability in physics, a phenomenon by which a sys- 
tem may be driven to attain a phase which is different 
from the one which would be stable in those circum- 
stances. Metastable states usually decay in stable states, 
which would lead to the wrong expectation of a sharp 
transition in the system of which we're inferring the cou- 
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plings. Actually, the distribution for the real system in 
this case does not have a second peak corresponding to 
that of the inferred model. 

The increasing relevance of methods of statistical 
learning of high-dimensional models from data in a wide 
range of disciplines, makes it of utmost importance to 
understand which features of the inferred models are in- 
duced solely by the inference scheme and which ones re- 
flect genuine features of the real system. In this respect, 
the understanding of collective behavior of models of sta- 
tistical mechanics provides a valuable background. This 
is particularly true, in the presence of phase transitions 
of the associated statistical mechanics model, where the 
mapping between microscopic interaction and collective 
behavior is no longer single valued. The emphasis which 
the study of phase transitions and critical point phenom- 
ena has received in statistical physics, assumes a special 
relevance for inference, in the light of our findings. 
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